50 research outputs found
Digging into acceptor splice site prediction : an iterative feature selection approach
Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction.
We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature.
The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets
Efficient Feature Selection and Multiclass Classification with Integrated Instance and Model Based Learning
Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced
Large-scale Nonlinear Variable Selection via Kernel Random Features
We propose a new method for input variable selection in nonlinear regression.
The method is embedded into a kernel regression machine that can model general
nonlinear functions, not being a priori limited to additive models. This is the
first kernel-based variable selection method applicable to large datasets. It
sidesteps the typical poor scaling properties of kernel methods by mapping the
inputs into a relatively low-dimensional space of random features. The
algorithm discovers the variables relevant for the regression task together
with learning the prediction model through learning the appropriate nonlinear
random feature maps. We demonstrate the outstanding performance of our method
on a set of large-scale synthetic and real datasets.Comment: Final version for proceedings of ECML/PKDD 201
Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions
BACKGROUND
The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder.
METHODOLOGY/PRINCIPAL FINDINGS
Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables.
CONCLUSIONS
Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation
Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
<b>Background</b>
The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.<p></p>
<b>Results</b>
We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets.<p></p>
<b>Conclusions</b>
The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis
Individualized markers optimize class prediction of microarray data
BACKGROUND: Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. RESULTS: Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression. CONCLUSION: In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions
Accurate molecular classification of cancer using simple rules
<p>Abstract</p> <p>Background</p> <p>One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.</p> <p>Methods</p> <p>We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.</p> <p>Results</p> <p>We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.</p> <p>Conclusion</p> <p>In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.</p